Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma
TestSlide
Code
data <-read_csv("All_threat_data.csv")ggplot(data, aes(x =factor(Status), fill =factor(Status))) +geom_bar(show.legend =FALSE) +scale_fill_brewer(palette ="Paired") +labs(title ="Barplot of Status", x ="Status", y ="Frequency") +theme_minimal() +theme(text =element_text(size =12),plot.title =element_text(hjust =0.5),axis.title =element_text(size =14, face ="bold"),axis.text.x =element_text(angle =45, hjust =1),panel.grid.major =element_line(color ="grey80"),panel.grid.minor =element_blank())
Random Forest Overview
Ensemble machine learning method based on a large number of decision trees voting to predict a classification
Benefits compared to decision tree:
Able to function with incomplete data -Lower likelihood of an overfit -Improved prediction accuracy
Bootstrap Sampling (Bagging)
Each decision tree uses a random sample of the original dataset
Using a subset of the dataset reduces the probability of an overfit model
Rows with missing data will often be left out of the sample, improving performance
Performed with replacement
Random Feature Selection
A random set of features is selected for each node in training
Information about feature importance may be saved and applies in future iterations
Even with automated random feature selection, feature selection and engineering prior to training may improve performance
Code
ctrl <-trainControl(method ="cv", number =10) bagged_cv <-train( Group~ LF + GF + Biomes + Range + Habitat_degradation + Habitat_loss + IAS + Other + Unknown + Other + Over_exploitation,data = species_train,method ="treebag",trControl = ctrl,importance =TRUE)plot(varImp(bagged_cv), 10)
Cross Validation
Validation of performance of model
Resampling method similar to bootstrapping, but without replacement
Allows approximation of the general performance of a model
Code
m3 <-rpart(formula = Group~ LF + GF + Biomes + Range + Habitat_degradation + Habitat_loss + IAS + Other + Unknown + Other + Over_exploitation,data = species_train,method ="anova" )rpart.plot(m3)
Prediction
Each trained decision tree produces its own prediction
Decision trees are independent, and were trained on different subsets of both data and features
Ensemble Voting
The results from each decision tree are combined into a voting classifier
The mode of the classification results will be the final prediction
Dataset
South African Red List
Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
Purpose
Predict whether or not an unknown plant is threatened based on the above characteristics
Visuals 1
Distribution Range
Code
ggplot(data = data, aes(x = Status, y = Range, fill = Status)) +geom_boxplot() +theme_bw() +ylim(0,100000)
Visuals 2
Cramer’s V Association with Range binned into 20 categories
Target feature Group is most associated with Range, Family, Habitat Loss, Biome, and GF
The most associated features will likely be the most important features during model training
Colinearity does not appear to be present, further checks are
Code
#Binning Range to make it categoricalcorrDFRange <- corrDFRange %>%mutate(Range=ntile(Range, n=20))corrplot::corrplot(DescTools::PairApply(corrDFRange,DescTools::CramerV), type='lower')
Analysis
5 separate random forest models were created using separate methods of normalization
Data Preparation
Preprocessing
Encode categorical features into numerical / factor features
Split the training set into a training and test set, avoiding class imbalance
Preprocessing
Class Imbalance
Resample smaller classes in order to approximate equal classes
Training on imbalanced datasets will bias predictions to the larger class